Data Science Project
Location Predictions of Squirrels Based on Several Characteristics

Names

Yimin Chen (yc4195)

Jiong Ma (jm5509)

Feng Yan (fy2297)

Wenjing Yang (wy2369)

Yang Yi (yy3307)

Abstract

In this project, we aimed to develop a predictor for users to predict the locations of squirrels in central park if several characteristics were provided. We first cleaned the data and built two linear functions, one for longitudinal and one for latitudinal, to predict the locations and found that shift, age, primary fur color, activity, reaction and sounds are important when making predictions of longitudinal value. Shift, primary fur color, activity, interaction and sounds are important for latitudinal predictions. Also, another dataset containing squirrels in whole NYC was also included to compare with the central park. Besides, some summary plots in both central park and NYC were made to generally look at and compare different distributions and frequencies. We used interactive shiny app to make the locations predictor, but it could only apply in central park because the data in central park is not representative of the whole city.

Introduction

General introduction of the raw dataset:

Some data of squirrels in New York central park were collected starting from October 6th 2018 to 20th over a 14-day period. Some of their characteristics like ages and fur colors and some of the activities like sounds and locations were recorded.

Motivation and initial question:

Squirrels are found everywhere, and it’s observed that some places have more squirrels than others, but is there any trend of where they stay with respect to their colors, ages, activities or all other features? Doing an analysis using squirrel census data may answer the question.

Main final goal:

The main final goals of the project are to make maps according to the census and build functions, models and interactive shiny app to predict locations of particular squirrels if their characteristics are provided so that people can use the website to look for the kinds of squirrels they like.

Inspirations:

The maps-making process shown in class caught our eyes since it is a clear and direct way to convey the information. A website called The Squirrel Census (https://www.thesquirrelcensus.com/about ) did research on the squirrels too, but it doesn’t provide any predictions, so we aim to develop a prediction system. Also, in order to attract more people to the website, some interactive plots will be made to fully introduce the raw dataset. The census provides only the data of central park in 2018, so other datasets like data of central park in other years or data in 2018 of other places will also be collected to make comparisons of any location changes.

Method

Source:

The original raw data we used to analysis is from NYC Open Data, it includes 3,023 observations in total and some of their characteristics and corresponding locations are recorded

Preliminary work:

First off, the original raw dataset only includes the data from 2018 in central park, other two datasets were found as extra supporting materials to compare with the raw model. One dataset contains information about not only the squirrels in central park, but also in the whole new york city. Another one is about characteristics and behaviors of different animals and squirrels are also included.

Data cleaning:

For data tidy and cleaning, the categorical variables were transformed to numeric ones for analysis and model building. We didn’t discard the missing values or unknowns directly but recode them as 0 since there are a lot of them, and omitting them might may lose the validity of the prediction. The dates were also cleaned.

We encoded squirrels’ activities under the “activity” column: If activity = ”running”, the specific observations were recoded to “1”; If activity = “eating”, the specific observations were recoded to “2”; If activity = “foraging”, the specific observations were recoded to “3”; If activity = “climbing”, the specific observations were recoded to “4”; If activity = “chasing”, the specific observations were recoded to “5”.

We encoded squirrels’ interaction with humans under the “reaction” column: If reaction = “indifferent”, the specific observations were recoded to “1”; If reaction= “runs_from”, the specific observations were recoded to “2”; If reaction= “approaches”, the specific observations were recoded to “3”.

We encoded squirrels’ sounds under the “sounds” column: If sound = “kuks”, the specific observations were recoded to “1”; If sound = “quaas”, the specific observations were recoded to “2”; If sound = “moans”, the specific observations were recoded to “3”.

We encoded squirrels’ primary fur color under the “primary_fur_color” column: If color = “Gray”, the specific observations were recoded to “1”; If color = “Cinnamon”, the specific observations were recoded to “2”; If color = “Black”, the specific observations were recoded to “3”.

We encoded whether the sighting session of squirrels occurred in the morning or late afternoon under the “shift” column: If shift = “AM”, the specific observations were recoded to “1”; If shift = “PM”, the specific observations were recoded to “2”.

We encoded age groups of squirrels under the “age” column: If age = “Adult”, the specific observations were recoded to “1”; If age = “Juvenile”, the specific observations were recoded to “2”.

At last, we kept “unique_squirrel_id”, “hectare”, “shift”, “date”, “heactare_squirrel_number”, “age”, “primary_fur_color”, “highlight_fur_color”, “combination_of_primary_and_highlight_color”, “location”, “lat_long”, “long”, “lat”, “activity”, “reaction”, and “sounds” columns in tidied dataset to do the further data analysis and model building.

Model building process:

Since the outputs are both longitudinal and latitudinal, we expected to make two linear functions, with longitudinal and latitudinal being the outputs separately against predictors, and combine the two outputs in the end. We built several models for both longitudinal and latitudinal outcomes using different methods (p-value, step-wise (both backward and forward at the same time), criterion-based, and LASSO). The following explanations are for longitudinal only and the latitudinal one follows the exactly same procedures.

The first step is to throw all the numerical variables into the model and check the p-value, the variables are shift + age + primary_fur_color + location + activity + reaction + sounds. Although hectare is also a numerical variable, it’s not included because the users of the model would not have the information of how many squirrels are there within a specific hectare, but they only have the information about the characteristics of specific squirrels that they want to look for. The variables with p-value less than 0.05 were removed from the model, and the model built with remaining variables was checked again to make sure that all of them had p-value less than 0.05. So, the first model candidate was produced with predictors being ‘shift’, ‘age’, ‘activity’, ‘reaction’, ‘sounds’.

Then, we selected model using automatic procedure, specifically step-wise regression procedure. Backward, Forward or step-wise methods might produce different results, but we chose to use step-wise since it gives a single ‘best’ model. As the result, except for the location, all other 6 variables are included in this model, which is the second model candidate.

Next, we used criterion-based procedure. The model with the largest adjusted R-square valued along with smallest AIC and BIC values are chosen to be the model candidate. It turned out that it also had all those 6 variables as the one in automatic procedure.

LASSO model selection method was then used. After looking for the best lamda value, the third model candidate has all seven predictors, which means no variable was deleted from the selection procedure.

We have four different models as the final ‘best’ model candidate for now, and they are all nested within each other. We choose the ‘best’ model according to 3 criteria, the adjusted R-squared value, the cross-validation outcome(i.e. RMSE) and the rule of parsimony. For longitudinal model, the final predictors have 5 predictors (shift + age + activity + reaction + sounds) since all four models have pretty much the same adjusted R-squared and RMSE values. So, according to principal of parsimony, we chose the most succinct one. As for the latitudinal model, it has 3 predictors (sounds + primary_fur_color + reaction), and it was also chosen because of parsimony.

Statistical tests:

We use anova tests to make sure the predictors of our final longitudinal and latitudinal regression fits are significant. Based on the results, the p-values of all five predictors (shift, age, activity, reaction and sounds) are smaller than 0.05 so we reject the null hypothesis and conclude that every predictor in our final model is significant. Also, the p-values of all three predictors (primary fur color, reaction and sounds) are smaller than 0.05 so we reject the null hypothesis and conclude that every predictor in our final model is significant.

Shiny

Our first Shiny app, interactive squirrel tracker map, allows users to target squirrels using certain characteristics. Users may also explore summarized information about these filtered squirrels in the data table below and look for a particular one via the search box.

Our second Shiny app, squirrel locations predictor map, assists users in determining the most likely location of a squirrel based on specified features. The function we used to acquire the answer depends on the model we chose and built. To learn more about our methodology, refer “Model building process”.

Results

Data summaries:

Except for the central park squirrel data, we also found another dataset which contains the squirrels data in whole NYC so that we could compare if there is a difference between squirrels in central park and other places in New York.

The first graph we drew was ‘Number of Observations’ v.s. ‘Time of Day’, and morning and afternoon data were separated and found out that squirrels tended to be more active in the afternoon or at night time. However, the limitation of the data was that we were not able to get the exact time period of their activities but only either morning or evening, we can assume they are present prior to sunset since they should be busy collecting the food when there is sunlight.

The second graph we drew was ‘Number of Observations’ v.s. ‘Primary Fur Color’, it’s clearly shown that different number of observations were made in different days and there is no clear pattern. Squirrels were observed to be the most active on Oct.7 and Oct.13, and they clearly became less active in last few days. Generally, the gray squirrels were the most massive and black ones were the fewest. The color of cinnamon was also pretty frequently observed with some color-not-identified ones.

The third graph we drew was a pie chart indicating the distribution of squirrels by their physiological age. The majority (88.6%) of the them was adult while the remaining 11.4% was juvenile. It’s not sure how their age stage was determined by the observers, maybe by their sizes. The limitation was that only ‘adult’ and ‘juvenile’ were categorized, but the predictions might be more valid if other stages like ‘baby’ or ‘old’ were provided.

The fourth graph we drew was to show the distribution of only adult squirrels by their primary fur color. The majority (83.6%) of the adult squirrels were gray. 12.8% of them were cinnamon, and the rest 3.62% were color of black.

The fifth graph we drew was to show the distribution of only juvenile squirrels by their primary fur color. The distribution was similar as the adult ones. 79.5% of the juvenile squirrels were gray. 18% of them were cinnamon, and the rest 2.5% were black.

The sixth graph we drew was to show the activities in squirrels by their different primary fur colors. No matter of the fur colors, they tended to forage the most frequently and chase the least frequently, which makes sense because squirrels needed to store foods during cold months.

The last graph of central park dataset we drew was to show how distributions of different activities differ by the locations. For example, climbing happened above ground most of the time, but foraging, running and eating basically happened on ground plane. Activities like chasing has equal probabilities of happening both above ground and on ground plane.

Then comes the analysis of NYC squirrels.

The first graph we drew about NYC data was to show the activities in squirrels by their different primary fur colors. Constrained on the activities indicated in the central park dataset, gray squirrels like to climb the most frequently and chase the least frequently. Black and gray squirrels like to forage the most and chase the least.

The second graph about NYC data was to show how distributions of different activities differ by the locations. Activities like eating, foraging and running most happened on the ground plane, but other activities like chasing and climbing happened above ground for the most of time.

The interactive map show the distribution of each squirrel by different primary fur colors. Users could zoom in or out to find the clusters of squirrels.

Main model:

Predictions of longitudinal and latitudinal use two separate functions since they have different predictors.

The function of longitudinal has 5 predictors (shift + age + activity + reaction + sounds). The formula of the longitudinal function is: longitudinal = -73.97 + 0.0005639 * shift - 0.0009412 * age - 0.0002267 * activity + 0.0004809 * reaction + 0.002142 * sounds

The function of has 3 predictors (sounds + primary_fur_color + reaction). The formula of the latitudinal function is: latitudinal = 40.781005 - 0.0013140 * primary_fur_color + 0.0007784 * reaction + 0.0029074 * sounds

Comparisons:

We made comparisons between squirrels data in only central park and whole NYC.

We found that the squirrels in central park, no matter of their colors, like to forage the most and chase the least. However, squirrels with different colors behaved differently in NYC. Gray squirrels like to climb the most, but black and cinnamon squirrels tend to forage the most frequently.

As for the activities in different locations, both central park and whole NYC follow basically the same trend that activities like eating, foraging and running were most likely on ground plane and other activities happened above ground. However, activity of chasing was a little bit different. In central park, chasing on ground plane is a slightly more likely than above ground, but for whole NYC, about half of the chasing are above ground.

Discussion

Findings:

In this project, we found that the behaviors of squirrels in central park were not exactly the same as squirrels in other parts of New York, so the prediction model may not be representative enough to apply to all squirrels in whole NYC. The prediction shiny app we made was tentatively useful to users who want to look for specific squirrels they like in central park.

We also found that the predictors used to predict longitudinal and latitudinal of the positions are different, which was not out of our initial expectation because for example, some specific squirrels only like colder places which indicated higher latitudinal places but had nothing to do with longitudinal value.

Conclusion

We used tidied central park data to make the prediction models and app for both longitudinal and latitudinal of squirrels’ locations. Data outside of central park was also analyzed and it didn’t follow the same trend. So, the accurate predictions could only apply in central park. Squirrel tracker and map allow people to locate the squirrels. For future direction, we may collect more data which was not only constrained in 2018 but other years to compare the prediction models.